Quantization Fundamentals

Optimization

Author

Ritesh Kumar Maurya

Published

May 13, 2024

This is completely based on Quantization Fundamentals
For the code part, you can checkout this link
Quantizatio helps to reduce the size of the model with little or no degradation.

Handling Big Models

Current model compression technique:-

Pruning:-remove connections that do not improve the model.
Knowledge Distillation:- Train a smaller model(Student) using the original model(Teacher). Cons:-You need to have enough hardware to fit both teacher as weel student both.

Options to quantize:-

You can quantize weights.
You can quantize activations that propagate through the layers of neural network

Idea:Store the parameters of the model in ower precision

Data Types and Sizes

Integer

Unsigned Integer (8-bit):- Range is [0, 255] [0, 2ⁿ-1] (All 8 bits are used to represent the number)
Signed Integer (8-bit):- Range is [-128, 127] [-2^n-1, 2^n-1-1] (7 bits are used to represent the number and the 8th bit represent the sign 0:Positive 1:Negative)

Data Type	torch.dtype	torch.dtype alias
8-bit signed integer	torch.int8
8-bit unsigned integer	torch.uint8
16-bit signed integer	torch.int16	torch.short
32-bit signed integer	torch.int32	torch.int
64-bit signed integer	torch.int64	torch.long

You can use below mentioned code to find out the more info

import torch
print(torch.iinfo(torch.int8))

iinfo(min=-128, max=127, dtype=int8)

Floating Point

3 components in floating point: Sign:- positive/negative (always 1 bit) Exponent(range): impact the representable range of the number Fraction(precision): impact on the precision of the number

FP32, BF16, FP16, FP8 are floating point format with a specific number of bits for exponent and the fraction.

FP32

Sign: 1 bit
Exponent(range): 8 bit
Fraction(precision): 23 bit
Total: 32 bit

BF16

Sign: 1 bit
Exponent(range): 8 bit
Fraction(precision): 7 bit
Total: 16 bit

FP16

Sign: 1 bit
Exponent(range): 5 bit
Fraction(precision): 10 bit
Total: 16 bit

Comparison Of Data Types

Data Type	Precision	Maximum
FP32	Best	~10⁺³⁸
FP16	Better	~10⁰⁴
BF16	Good	~10³⁸

Data Type	torch.dtype	torch.dtype alias
16-bit floating point	torch.float16	torch.half
16-bit brain floating point	torch.bfloat16
32-bit floating point	torch.float32	torch.float
64-bit floating point	torch.float64	torch.double

import torch
print("By default, python stores float data in fp64")
value = 1/3
tensor_fp64 = torch.tensor(value, dtype = torch.float64)
tensor_fp32 = torch.tensor(value, dtype = torch.float32)
tensor_fp16 = torch.tensor(value, dtype = torch.float16)
tensor_bf16 = torch.tensor(value, dtype = torch.bfloat16)

print(f"fp64 tensor: {format(tensor_fp64.item(), '.60f')}")
print(f"fp32 tensor: {format(tensor_fp32.item(), '.60f')}")
print(f"fp16 tensor: {format(tensor_fp16.item(), '.60f')}")
print(f"bf16 tensor: {format(tensor_bf16.item(), '.60f')}")

print(torch.finfo(torch.bfloat16))

By default, python stores float data in fp64
fp64 tensor: 0.333333333333333314829616256247390992939472198486328125000000
fp32 tensor: 0.333333343267440795898437500000000000000000000000000000000000
fp16 tensor: 0.333251953125000000000000000000000000000000000000000000000000
bf16 tensor: 0.333984375000000000000000000000000000000000000000000000000000
finfo(resolution=0.01, min=-3.38953e+38, max=3.38953e+38, eps=0.0078125, smallest_normal=1.17549e-38, tiny=1.17549e-38, dtype=bfloat16)

PyTorch Downcasting

when a higher data type converted to a lower data type, it results in loss of data
Adavantages:
- Reduced memory footprint
- Increased compute and speed (Depends on the hardware)
Disadvantages:
- Less precise
Use case:
- Mixed precision training
  - Do computation in smaller precision (FP16/BF16/FP8)
  - Store and update the wights in higher precision (FP32)

Loading Models by data type

target_dtype = torch.float16 or torch.bfloat16
model = model.to(target_dtype)
model = model.half() for fp16
model = model.bfloat16() for bfloat16
Always use bfloat16 instead of float16 while using pytorch-cpu
FP32 is default in pytorch
model.get_memory_footprint()/1e+6
torch.set_default_dtype(desired_dtype) # By doing so we can directly load the model in desired dtype without loading in full precision and then quantizing it
set it back to float32 to avoid unnecesary behaviors

Quantization Theory

Quantization refers to the process of mapping a large set to a smaller set of values.
How do we convert the FP32 weights to INT8 without losing too much information??
- It is done using a linear mapping using linear mapping parameters.
- s = scale
- z = zero point
How do we get back our original tensor from the quantized tensor?
- We can’t get exactly the original tensor but using dequantization following linear relationship that used to quantize the original tensor.

Quantize Using Quanto Library

from quanto import quantize, freeze
quantize(model, weights = desired_dtype, activations = desired_dtype)
freeze(model)
quantize create an intermediate state of the model
after calling freeze, we get the quantized weights

Uses of the Intermediate State

Calibration
- Calibrate model when quantizing the activations of the model.
  - Range of activation values depends on what input was given. (e.g. different input text will generate different activations)
  - Min/Max of activation ranges are used to perform linear quantization.
  - How to get min and max range of activations?
    - Gather sample input data.
    - Run inference.
    - Calculate min/mac of activations
- Result: better quantized activations
Quantization Aware Training
- Training in a way that controls how the model performs once it is quantized.
- Intermediate state holds both(quantized as well as unquantized weights).
- Use quantized version of model in forward pass (e.g. BF16)
- Update original, unquantized version of model weights during back propogation (e.g. FP32)
In L4 there is a function in helper.py to calculate the model size

Linear Quantization

Even if it looks very simple, it is used in many SOTA quantization methods:

AWQ: Activation-aware Weight Quantization
GPTQ: GPT Quantized
BNB: BitsandBytes Quantization

Simple idea: linear mapping
r = s(q-z)
- where
- r:- original value(e.g. in FP32)
- s:- scale(e.g. in FP32)
- q:- quantized value(e.g. in INT8)
- z:- zero point(e.g. INT8)
How do we get scale and zero pint??
- s = (r_max-r_min)/(q_max-q_min)
- z = int(round(q_min-r_min/s))

Quantization of LLMs

Recent SOTA quantization methods:-

LLM.INT8 (only 8-bit)
- Decomposes the mat-mul in two stages (outlier part in float16 and non-outlier part in int8).
QLoRA (only 4-bit)
- Quantize as well as fine-tune the adapters
AWQ
GPTQ
SmoothQuant

More recent SOTA quantization methods for 2-bit quantization

QuIP#
HQQ
AQLM

All are open-source

Some Quantization Methods require calibration (from above)

Some Quantization Methods require Adjustments

Many of these methods were applied to LLMs, but if we want then we can apply to other type of models by making few adjustments to the quantization methods

Some methods can be applied without making adjustments
- Linear quantization
- LL.INT8
- QLoRA
- HQQ
Other approaches are data-dependent
There are distributors on HuggingFacewhich gives a quantized version of popular models (TheBloke)
Checkout HuggingFace Open LLM leaderboard to see how these quantized models are performing
Benefits of fine-tuning a quantized model:
- Recover the accuracy from quantization
- Tailor your model for specific use-cases and applications
Fine tune with Quantization Aware Training (QAT)
- Not compatible with Post Training Quantization (PTQ) techniques.
- The linear quantization method is an example of PTQ.
- PEFT + QLoRA
  - QLoRa quantizes the pre-trained base weights in 4-bit precision.
  - This matches the precision of the LoRA weights.
  - This allows the model to add the activations of the pre-trained and adapter weights.
  - This sum of the two activations can be fed as the input to the next layer of the network.